Author = "Aaron Stephenson"
ASUid = "1222366145"
#Success in the Video Game Industry
The gaming industry generates billions of dollars in revenue each year. It is often very difficult to stay ahead of the competition when it comes to what will be popular and when. Game developers and publishers have to know if a game will perform well before investing significant resources into its development and marketing. The problem that we want to tackle is to predict the success a video game will have in the market. The "Video Game Sales" dataset, which was found on Kaggle, will be used for modeling purposes. This dataset has games from 1980-2016, including data on the games's name, platform, release year, genre, publisher, and global sales in millions of units. The goal is to build a machine learning model that can predict the success of a video game based on its characteristics. This type of information could help game developers and publishers make informed decisions about which games to invest vast resources into. With proper modeling it can help optimize their marketing and distribution strategies.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tensorflow as tf
import sklearn
#confusion_matrix and accuracy_score may come in handy
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
#KNN Classifier and Regression models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
#functions to split and scale our data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
#Functions to test my findings
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score
#Allows us to exhaustively search for best hyperparameter using cross-validation
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasRegressor
from keras.models import Sequential
from keras.layers import Dense
To begin, we loaded the video game sales dataset, which contains information on video games released from 1980 to 2016, including data on the game's name, platform, release year, genre, publisher, and global sales in millions of units. Before proceeding with any analysis, we cleaned the data by dropping rows with missing values and renaming the "Year_of_Release" column to "Year" for ease of use.
df = pd.read_csv('Game_Success.csv')
# Clean the dataset by dropping missing values
df = df.dropna()
#Rename column for easier future use
df = df.rename(columns = {"Year_of_Release": "Year"})
df
| Name | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Critic_Score | Critic_Count | User_Score | User_Count | Developer | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | .hack//Infection Part 1 | 2002 | Role-Playing | Atari | 0.49 | 0.38 | 0.26 | 0.13 | 1.27 | 75 | 35 | 8.5 | 60 | CyberConnect2 | T |
| 1 | .hack//Mutation Part 2 | 2002 | Role-Playing | Atari | 0.23 | 0.18 | 0.20 | 0.06 | 0.68 | 76 | 24 | 8.9 | 81 | CyberConnect2 | T |
| 2 | .hack//Outbreak Part 3 | 2002 | Role-Playing | Atari | 0.14 | 0.11 | 0.17 | 0.04 | 0.46 | 70 | 23 | 8.7 | 19 | CyberConnect2 | T |
| 3 | [Prototype] | 2009 | Action | Activision | 0.84 | 0.35 | 0.00 | 0.12 | 1.31 | 78 | 83 | 7.8 | 356 | Radical Entertainment | M |
| 4 | [Prototype] | 2009 | Action | Activision | 0.65 | 0.40 | 0.00 | 0.19 | 1.24 | 79 | 53 | 7.7 | 308 | Radical Entertainment | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6889 | Zubo | 2008 | Misc | Electronic Arts | 0.08 | 0.02 | 0.00 | 0.01 | 0.11 | 75 | 19 | 7.6 | 75 | EA Bright Light | E10+ |
| 6890 | Zumba Fitness | 2010 | Sports | 505 Games | 1.74 | 0.45 | 0.00 | 0.18 | 2.37 | 42 | 10 | 5.5 | 16 | Pipeworks Software, Inc. | E |
| 6891 | Zumba Fitness: World Party | 2013 | Misc | Majesco Entertainment | 0.17 | 0.05 | 0.00 | 0.02 | 0.24 | 73 | 5 | 6.2 | 40 | Zoe Mode | E |
| 6892 | Zumba Fitness Core | 2012 | Misc | 505 Games | 0.00 | 0.05 | 0.00 | 0.00 | 0.05 | 77 | 6 | 6.7 | 6 | Zoe Mode | E10+ |
| 6893 | Zumba Fitness Rush | 2012 | Sports | 505 Games | 0.00 | 0.16 | 0.00 | 0.02 | 0.18 | 73 | 7 | 6.2 | 5 | Majesco Games, Majesco | E10+ |
6825 rows × 15 columns
We then proceeded with exploratory data analysis, starting with the distribution of the video game sales data. We plotted a histogram of the global sales, which showed that the majority of games have sales less than 5 million units. There is a long tail in the distribution, indicating that a few games have sold very well, potentially skewing the data. We also plotted a bar chart of the top 10 best-selling video games, with "Wii Sports" being the highest-selling game with over 80 million units sold.
Next, we examined the video game market by year, looking at the number of games released and the total global sales per year. The plot of the number of games released showed a steady increase in the number of games over time, with a sharp increase from the 2000s onwards. The plot of total global sales per year showed a similar trend, with a sharp increase in sales from the mid-90s to the mid-2000s and a decline thereafter.
Moving on, I plotted bar charts for the top 10 genres, with "Action" being the top genre and "Nintendo" being the top publisher. We also created scatterplots to visualize the relationship between global sales and other features such as critic and user scores, finding a weak positive correlation between global sales and both critic and user scores.
# Get summary statistics of the dataset
df.info()
# Show the number of missing values in each column
print(df.isnull().sum())
# Plot a bar plot of the total number of games in each genre
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='Genre')
plt.title('Count of Each Genre')
plt.xticks(rotation=45)
plt.show()
#Set the plot style to block
sns.set_style('dark')
#Create a scatter plot of critic scores vs. user scores, with color based on genre
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='Critic_Score', y='User_Score', hue='Genre', alpha=0.7, palette='bright')
plt.title('Critic Scores vs. User Scores by Genre')
plt.xlabel('Critic Scores')
plt.ylabel('User Scores')
#Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6825 entries, 0 to 6893 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 6825 non-null object 1 Year 6825 non-null int64 2 Genre 6825 non-null object 3 Publisher 6825 non-null object 4 NA_Sales 6825 non-null float64 5 EU_Sales 6825 non-null float64 6 JP_Sales 6825 non-null float64 7 Other_Sales 6825 non-null float64 8 Global_Sales 6825 non-null float64 9 Critic_Score 6825 non-null int64 10 Critic_Count 6825 non-null int64 11 User_Score 6825 non-null float64 12 User_Count 6825 non-null int64 13 Developer 6825 non-null object 14 Rating 6825 non-null object dtypes: float64(6), int64(4), object(5) memory usage: 853.1+ KB Name 0 Year 0 Genre 0 Publisher 0 NA_Sales 0 EU_Sales 0 JP_Sales 0 Other_Sales 0 Global_Sales 0 Critic_Score 0 Critic_Count 0 User_Score 0 User_Count 0 Developer 0 Rating 0 dtype: int64
import plotly.express as px
# Group the dataset by year and genre, count the number of games in each group
games_by_year_genre = df.groupby(['Year', 'Genre']).size().reset_index(name='Count')
# Create the interactive bar plot
fig = px.bar(games_by_year_genre, x='Year', y='Count', color='Genre', title='Number of Games Released by Year and Genre',
labels={'Count':'Number of Games Released'}, hover_name='Genre')
# Display the plot
fig.show()
top_selling = df[['Name', 'Global_Sales']].groupby('Name').sum().sort_values(by='Global_Sales', ascending=False).head(10)
plt.figure(figsize=(12,6))
plt.barh(top_selling.index[::-1], top_selling['Global_Sales'][::-1])
plt.title('Top 10 Best-Selling Games of All Time')
plt.xlabel('Global Sales (millions)')
plt.show()
Overall, our exploratory data analysis provided insights into the distribution of global video game sales, the video game market trends over time, and the relationship between video game sales and various game characteristics. These insights will guide our subsequent feature engineering and machine learning model building.
#Create a correlation matrix
corr_matrix = df.corr()
#Set the figure size
fig, ax = plt.subplots(figsize=(16, 10))
#Plot a heatmap of the correlation matrix
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, ax=ax)
plt.title('Correlation Matrix for Video Game Features')
plt.show()
By examining the plot, we can see that the Global_Sales are positively correlated with the Critic_Score, User_Score, and Year features. This suggests that games with higher critic and user scores tend to sell better, as well as more recent releases. On the other hand, there is a weak negative correlation between Global_Sales and the Rank feature, indicating that games with lower ranks tend to sell better.
It's also interesting to see that the correlation between the sales in different regions (NA_Sales, EU_Sales, JP_Sales, Other_Sales) and Global_Sales is relatively strong, which is expected since global sales are a sum of sales in different regions.
Overall, this correlation matrix can give us some insight into which features may be important in predicting the success of a video game, and can help guide the feature selection process for our machine learning models.
from sklearn.pipeline import make_pipeline
#Select the features and target variable
X = df[['Critic_Score', 'User_Score']].values
y = df['Global_Sales'].values
#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
#Create a pipeline with a StandardScaler and LinearRegression
model = make_pipeline(StandardScaler(), LinearRegression())
#Fit the pipeline to the training data
model.fit(X_train, y_train)
#Make predictions on the test set
y_pred = model.predict(X_test)
#Calculate and print the mean squared error and R-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error: ', mse)
print('R-squared score: ', r2)
Mean Squared Error: 2.6931840681914085 R-squared score: 0.08383146751958037
'''The mean squared error (MSE) of the linear regression model is 2.693, which
means that on average, the model's predicted global sales value is off by
around 1.64 million units squared. The R-squared score of the model is 0.084,
which indicates that only 8.4% of the variability in the global sales of
video games can be explained by the model's predictor variables (critic score
and user score).'''
"The mean squared error (MSE) of the linear regression model is 2.693, which\nmeans that on average, the model's predicted global sales value is off by\naround 1.64 million units squared. The R-squared score of the model is 0.084,\nwhich indicates that only 8.4% of the variability in the global sales of\nvideo games can be explained by the model's predictor variables (critic score\nand user score)."
We are now preparing the dataset for our machine learning model. The first step is to create dummy variables for the 'Genre' column using the get_dummies() method from Pandas. This allows us to convert categorical data into numerical data, which is necessary for many machine learning algorithms. The Genre column is then dropped and 'concat()' is used to merge the dummy variables for the remaining of the dataset.
Next, define the input and output variables for our model. The input variables are the attributes that will be used to predict the output variable. In this case, the input variables are critic_score, user_score, developer, rating, year, and genre. The output variable is global_sales. Finally, split the dataset into training and testing sets using the train_test_split() method from Scikit-learn.
dumdum = pd.get_dummies(df['Genre'])
df = df.drop('Genre', axis=1)
df = pd.concat([df, dumdum], axis=1)
#Define the input and output variables
X = df.iloc[:, 5:11].values # attributes: critic_score, user_score, developer, rating, year, genre
y = df.iloc[:, -1].values # output variable: global_sales
#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
After splitting the data, we scale the input variables using StandardScaler(). Then, we build a k-nearest neighbor regression model using KNeighborsRegressor() with 5 neighbors. We evaluate the model's performance on the testing set and print the KNN score. The GridSearchCV() is used to search for the best hyperparameters for the KNN classifier model. The param_grid is set to different values and weight function. The best param score is then printed at the bottom, in order to possibly use later.
#Scale the input variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
#Build the k-nearest neighbor model
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
#Evaluate the k-nearest neighbor model
knn_score = knn.score(X_test, y_test)
print('KNN Score:', knn_score)
#Set up the parameter grid for GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance']}
#Create a KNN classifier object
knn = KNeighborsClassifier()
#Set up the GridSearchCV object
grid = GridSearchCV(knn, param_grid, cv=5)
#Fit the GridSearchCV object to the data
grid.fit(X_train, y_train)
#Print the best hyperparameters and the corresponding score
print("Best Hyperparameters:", grid.best_params_)
print("Best Score:", grid.best_score_)
KNN Score: -0.10310006209323208
Best Hyperparameters: {'n_neighbors': 11, 'weights': 'uniform'}
Best Score: 0.9630036630036629
#Set up the parameter grid for GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance']}
#Create a KNN classifier object
knn = KNeighborsClassifier()
#Set up the GridSearchCV object
grid = GridSearchCV(knn, param_grid, cv=5)
#Fit the GridSearchCV object to the data
grid.fit(X_train, y_train)
#Print the best hyperparameters and the corresponding score
print("Best Hyperparameters:", grid.best_params_)
print("Best Score:", grid.best_score_)
#Build the k-nearest neighbor model with the best hyperparameters
knn = KNeighborsClassifier(n_neighbors=11, weights='uniform')
knn.fit(X_train, y_train)
#Evaluate the k-nearest neighbor model
knn_score = knn.score(X_test, y_test)
print('KNN Score:', knn_score)
Best Hyperparameters: {'n_neighbors': 11, 'weights': 'uniform'}
Best Score: 0.9630036630036629
KNN Score: 0.9509157509157509
In this section, a neural network model is created to predict the success of a video game based on its characteristics. First, defined the model using the Keras. The model has two layers, the first with 10 neurons and the input dimension of 6, and the second with 1 neuron and a linear activation function. Then model was compiled with the mean squared error loss function and the Adam optimizer. The model was trained using the fit() function with the training data, 50 epochs, and a batch size of 16. After training, we evaluated the model on the test set using the evaluate() function with the mean squared error as the metric.
#Build the neural network
model = Sequential()
model.add(Dense(10, input_dim=6, activation='relu'))
model.add(Dense(1, activation='linear'))
#Compile the neural network
model.compile(loss='mse', optimizer='adam', metrics=['mse'])
#Train the neural network
model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=0)
#Evaluate the neural network
nn_score = model.evaluate(X_test, y_test, verbose=0)[1]
print('Neural Network Score:', nn_score)
Neural Network Score: 0.04527967795729637
The neural network has a mean squared error (MSE) score of 0.04 on the test set. This means that on average, the predicted global sales of a video game are off by 0.04 squared units (e.g. if global sales are measured in millions of units, then the MSE is in millions squared). The lower the MSE score, the better the model's performance in predicting the target variable. The neural network models score that indicates it is a suitable model for predicting the success of a video game based on its characteristics.
from sklearn.metrics import roc_curve, auc
y_pred = np.round(model.predict(X_test))
cm_test = confusion_matrix(y_test, y_pred)
#Get the predicted probabilities for the training and testing datasets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
#Create AUC train
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_train_pred)
auc_train = auc(fpr_train, tpr_train)
#Create AUC test
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_test_pred)
auc_test = auc(fpr_test, tpr_test)
print("AUC Score: ", auc_test)
43/43 [==============================] - 0s 572us/step 171/171 [==============================] - 0s 530us/step 43/43 [==============================] - 0s 572us/step AUC Score: 0.7608490674516476
An ROC evaluation was created to test the reliability of the neurel network model. The ROC graphics show that the neurel network is a good but not perfict fit. The curve shows that this model is able prevent false positives but may not always hit the true positive mark. If the AUC for the training dataset is significantly higher than the AUC for the testing dataset, it may indicate that the model is overfitting to the training data and not generalizing well to new data.
# your code here
#Plot AUC train
plt.plot(fpr_train, tpr_train, label='Train AUC = {:.3f}'.format(auc_train))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Training Set')
plt.legend(loc='lower right')
plt.show()
#Plot AUC test
plt.plot(fpr_test, tpr_test, label='Test AUC = {:.3f}'.format(auc_test))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Testing Set')
plt.legend(loc='lower right')
plt.show()
from sklearn.model_selection import KFold, cross_val_score
# Define the number of folds for cross-validation
num_folds = 5
# Define the K-fold cross-validator
kfold = KFold(n_splits=num_folds, shuffle=True)
# Define a list to store the cross-validation scores
scores = []
# Loop through each fold
for train_index, test_index in kfold.split(X):
# Split the dataset into training and testing sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Build the neural network
model = Sequential()
model.add(Dense(10, input_dim=6, activation='relu'))
model.add(Dense(1, activation='linear'))
# Compile the neural network
model.compile(loss='mse', optimizer='adam', metrics=['mse'])
# Train the neural network
model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=0)
# Evaluate the neural network on the testing set
scores.append(model.evaluate(X_test, y_test, verbose=0)[1])
# Calculate the mean and standard deviation of the scores
mean_score = np.mean(scores)
std_score = np.std(scores)
print('Mean score:', mean_score)
print('Standard deviation:', std_score)
Mean score: 0.06395705863833427 Standard deviation: 0.034344017608940534
The mean score and standard deviation for the k-fold cross-validation of the neural network are relatively low. These values suggest that the performance of the neural network is consistent across different folds and that the model is reasonably accurate in predicting the global sales of video games based on their features. However, it is important to note that the performance of the model may vary depending on the specific dataset and the distribution of the features.
Using the neural network created above, we can predict which genre of game would produce the largest global sale. We start with a fresh dataset named genre_df and make it a copy of our df. Slide gere_df into the neural network model from above to see what the predicted highest global sales genre will be.
This portion of code is only being used to test what the model predicted vs the data set averages. The purpose is to see ensure the the predicted values were not just following the data history. Gives a solid comparison of the predicted values of the model and the history of data that was collected.
genre_columns = ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy']
for genre in genre_columns:
genre_df = df[df[genre] == 1]
print(genre)
print('Mean Critic Score:', genre_df['Critic_Score'].mean())
print('Mean User Score:', genre_df['User_Score'].mean())
print('Mean Global Sales:', genre_df['Global_Sales'].mean())
print()
genre_mean_sales = df.groupby(genre_columns)['Global_Sales'].mean()
max_genre = genre_mean_sales.idxmax()
min_genre = genre_mean_sales.idxmin()
max_genre_name = ", ".join([genre_columns[i] for i in range(len(genre_columns)) if max_genre[i] == 1])
min_genre_name = ", ".join([genre_columns[i] for i in range(len(genre_columns)) if min_genre[i] == 1])
print("Genre with highest mean global sales: ", max_genre_name)
print("Genre with lowest mean global sales: ", min_genre_name)
Action Mean Critic Score: 67.82883435582822 Mean User Score: 7.095828220858897 Mean Global Sales: 0.7381349693251518 Adventure Mean Critic Score: 66.13306451612904 Mean User Score: 7.160887096774193 Mean Global Sales: 0.3256048387096776 Fighting Mean Critic Score: 69.73280423280423 Mean User Score: 7.3018518518518505 Mean Global Sales: 0.6612433862433865 Misc Mean Critic Score: 67.4609375 Mean User Score: 6.8497395833333306 Mean Global Sales: 1.0840104166666664 Platform Mean Critic Score: 70.0 Mean User Score: 7.377171215880896 Mean Global Sales: 0.9374689826302726 Puzzle Mean Critic Score: 70.69491525423729 Mean User Score: 7.2508474576271205 Mean Global Sales: 0.6686440677966101 Racing Mean Critic Score: 69.54388984509467 Mean User Score: 7.104302925989677 Mean Global Sales: 0.8196557659208258 Role-Playing Mean Critic Score: 72.82022471910112 Mean User Score: 7.618539325842696 Mean Global Sales: 0.704171348314606 Shooter Mean Critic Score: 70.98148148148148 Mean User Score: 7.086458333333339 Mean Global Sales: 0.9449999999999973 Simulation Mean Critic Score: 69.96969696969697 Mean User Score: 7.1966329966329985 Mean Global Sales: 0.6824915824915825 Sports Mean Critic Score: 74.17073170731707 Mean User Score: 7.11081654294804 Mean Global Sales: 0.8842523860021196 Strategy Mean Critic Score: 73.12359550561797 Mean User Score: 7.352808988764042 Mean Global Sales: 0.26071161048689145 Genre with highest mean global sales: Misc Genre with lowest mean global sales: Strategy
# Create a new dataframe
genre_df = df
# Use the trained neural network to predict global sales for each genre
genre_df['Predicted_Global_Sales'] = model.predict(genre_df.iloc[:, 5:11].values)
# Find the column with the highest predicted global sales
max_sales_col = genre_df.iloc[:, -13:-1].columns[np.argmax(genre_df.iloc[:, -13:-1].values)]
print('The genre with the highest predicted global sales is:', max_sales_col)
214/214 [==============================] - 0s 517us/step The genre with the highest predicted global sales is: Role-Playing
Based on the analysis of the video game sales data, it can be concluded that the Role-Playing genre is predicted to have the highest mean global sales, followed by the Shooter genre. On the other hand, the Adventure genre is predicted to have the lowest mean global sales, followed by the Puzzle genre.
It is important to note that these predictions are based on historical data and market trends and are subject to change in the future. However, the insights gained from this analysis can be useful for game developers and publishers in making informed decisions about which genres to focus on and invest in.
Overall, the Role-Playing genre appears to be a promising choice for developers and publishers, given its strong track record of global sales.